QC Overview

Row

A| QC distributions

B| Pairwise QC relationships

C| Mitochondrial content distribution

D| Cells Recovered: Pre- and Post-Filtering

Row

E| Pre- vs Post-Normalization Statistics

F| Transcripts per cell

G| Genes per cell

H| Variable gene plot

Row

Cells Recovered

40927

Genes per Cell (Median)

1060

Transcripts per Cell (Median)

2180

Filter and PCA Summary

Row

Filtering Statistics

PCA Scree Plots

Artifact Genes

UMAP

Row

Cell Clusters

QC Statistics

Cell Cycle

QC Tables

Row

orig.ident

Organism

Barcode

dataset

subset_group

SCT_snn_res.1

seurat_clusters

Phase

Variable Genes Table

Row

Log

Description Variable Name Value
Module 1, Preprocessing & QC
Date Sys.time() 2021-09-21 15:39:16
Input Cell Ranger input_data 16-Ochocka_2021_murine_GBM_immune-mctrl1, 17-Ochocka_2021_murine_GBM_immune-mctrl2, 18-Ochocka_2021_murine_GBM_immune-fctrl1, 19-Ochocka_2021_murine_GBM_immune-fctrl2, 20-Ochocka_2021_murine_GBM_immune-mtumor1, 21-Ochocka_2021_murine_GBM_immune-mtumor2, 22-Ochocka_2021_murine_GBM_immune-ftumor1, 23-Ochocka_2021_murine_GBM_immune-ftumor2
gene/cell Upper Limit (N/cell) gene.upperlimit 9000
gene/cell Lower Limit (N/cell) gene.lowerlimit 200
Filtered by unmatched reads unmatched.rate.filter.flag TRUE
Mitochondrial Content Upper Limit (%/cell) mt.upperlimit 10
Data Subsampled subsampled FALSE
Cluster Resolution cluster.resolution 1
Filtered by organism organism.filter.flag FALSE
Number of Workers normScale 1
Max Memory (Gb) max.memory 200
Variables regressed out vars2regress percent.mt
Input Species all_input_organisms Mm, Mm, Mm, Mm, Mm, Mm, Mm, Mm
Feature selection method feature.select.method hvg
Scaling method scale.method sct
PCA component selection method pca.component.select.method cum_var
Normalization Method norm_method SCT
PCA Method pca.method pca
Principal Components Included nDim 28
Unmatched rate upper limit unmatch.high 1
Unmatched rate lower limit unmatch.low 0
nPCA components used nDim 28
Run Time (s) elapsed.time 566.02
Run Identifier run.id R643_M01_NM2
User user NM2
Central Log Updated clog.update.success TRUE
Output Results (.Rdata) save.filename R643_M01_NM2_Ochocka2021_murineGBMimmune_210921.Rdata
Output Directory save.directory Preprocessed_Datasets/
Output Saved save.flag TRUE
---
title: "QC and Preprocessing"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: 
  flexdashboard::flex_dashboard:
    orientation: rows
    vertical_layout: fill
    source_code: embed
    theme: flatly
    navbar:
      - { title: "scPipeline", href: "https://github.com/NMikolajewicz/scPipeline" }
      - { title: "scMiko", href: "https://github.com/NMikolajewicz/scMiko" }  
editor_options: 
  chunk_output_type: inline
knit: (function(inputFile, encoding) {
    rmarkdown::render(input = inputFile,
      encoding = encoding,
      output_file = if (exists("user")){paste0(
        xfun::sans_ext(inputFile), '_',user, "_", 
        paste0(format(Sys.time(), "%d_%b_"), gsub(" ", "_", gsub(":", "_", format(Sys.time(), "%X"))), format(Sys.time(), "_%Y")), '.html'
      )} else {paste0(xfun::sans_ext(inputFile), '_',"Guest", "_", 
      paste0(format(Sys.time(), "%d_%b_"), gsub(" ", "_", gsub(":", "_", format(Sys.time(), "%X"))), format(Sys.time(), "_%Y")), '.html')},
      output_dir = if (exists("data.path")) paste0(data.path, "/HTML_Reports") else NULL
    )
  })
---

```{r setup, include=FALSE}

# clear global enviroment
rm(list = setdiff(ls(), c("data.path", "user")))
invisible({gc()})

# initiate timer
start.time <- proc.time()

# list of packages to load
packages2load <- c('viridis', 'scMiko', 'ggpubr', 'dplyr', 'WGCNA', 'cowplot', 'schex', 'Seurat', 'sctransform', 'plyr', 'tidyr', 'reshape2', 'Matrix', 'RColorBrewer', 'ggplot2', 'gridExtra', 'DT', 'flexdashboard', 'future', 'gplots', 'reticulate')

# load packages
invisible(lapply(packages2load, library, character.only = TRUE, quietly = T))

```


```{r c2 - parameter_specification}

# specify parameters
parameter.list <- list(
  which.data = c("16-O", "17-O", "18-O", "19-O", "20-O", "21-O", "22-O", "23-O"),
  which.strata = NA, #"wt"
  organism.filter.flag = F,                             
  organism.include = "Mm",
  save.flag = T,                                                              # OPTIONAL; logical (default = T)  # save results
  save.filename = "Ochocka2021_murineGBMimmune_210921.Rdata", # note that prefix containing module number and run ID is appended to save file. 
  save.directory = "Preprocessed_Datasets/",
  cluster.resolution = 1,
  gene.upperlimit = 9000, #9000
  gene.lowerlimit = 200,  #200
  mt.upperlimit = 10,   #60
  unmatched.rate.filter.flag = T,
  vars2regress = c("percent.mt"), # optional: "dataset" (for linear batchcorrection)
  correct.artifact = T,
  feature.select.method = "hvg", # options: hvg (recommended), deviance
  scale.method = "sct", # option: sct (recommended), null_residuals
  pca.method = "pca", #option: pca (recommended), glmpca
  pca.component.select.method = "cum_var", #options" cum_var, elbow
  pca.var.threshold = 0.9, # [0,1], ignored if pca.component.select.method = elbow
  conserve.memory = F,
  update.log = T,
  save.pdf = T,
  print.inline = F
)

# save.pdf <- T
n.workers <- list(
  normScale = 1
)
max.memory <- 200 # numeric, in terms of Gb (set to 100 for cluster)

```


```{r get input data paths}

if (!exists("data.path") & !exists("user")) {
  stop("data.path and user variables do not exist in global enviroment. Ensure that these are specified in .Rprofile. If specified, try restarting Rstudio.")
}

# get input data files paths
inputDataSpecification <- function(inputs = "M01_inputs.csv", subsets = "M01_subsets.csv"){
  
  input.data <- read.csv(inputs, stringsAsFactors = F)
  colnames(input.data) <- rmvCSVprefix(colnames(input.data))
  input.data <- apply(input.data, 2, function(x) gsub("Â", "", x))
  input.data <- as.data.frame(input.data)
  
  # get subsets
  input.subset <- read.csv(subsets, stringsAsFactors = F)
  
  # wrangle into list
  u.samples <- unique(input.subset$sample)
  input.barcode.strata <- list()
  for (i in 1:length(u.samples)){
    
    subset.strata <- input.subset[input.subset$sample %in% u.samples[i], ]
    current.strata <- list()
    
    for (j in 1:nrow(subset.strata)){
      
      current.barcodes <- subset.strata$barcodes[j]
      current.barcodes <- strsplit(current.barcodes, ", ")[[1]]
      current.strata[[subset.strata$set[j]]] <- current.barcodes
    }
    
    input.barcode.strata[[u.samples[i]]] <- current.strata
    
  }
  
  return(list(
    input.data = input.data,
    input.barcode.strata = input.barcode.strata
  ))
}

input.list <- inputDataSpecification()
input.data <- input.list$input.data
input.barcode.strata <- input.list$input.barcode.strata

```

```{r c3 - additional specifications and assertions}

# Specify input data files
which.data <- parameter.list$which.data  # REQUIRED 

if ("which.strata" %in% names(parameter.list)){
  which.strata <- parameter.list$which.strata # NA if no stratification
} else {
  which.strata <- NA
}


# specify input 
stopifnot(all(which.data %in% as.vector(input.data$sample)))
specified.input <- input.data[as.vector(input.data$sample) %in% which.data, ]


# get input data specifications
input_data <- as.vector(specified.input$files)
input_pcr_barcodes <- as.vector(specified.input$pcr.barcodes)
input_rt_barcodes <- as.vector(specified.input$rt.barcodes)
input_set <- c(input_data, input_pcr_barcodes, input_rt_barcodes)
input.type <- as.vector(specified.input$input.type)
all_input_organisms <- as.vector(specified.input$species)
dir <- as.vector(specified.input$directory) 

# ensure correct path is used for data input
if (exists("data.path")){
  dir <- paste0(data.path, dir)
}

# ensure regression variables are correctly specified
if (length(which.data) == 1){
  parameter.list$vars2regress <- parameter.list$vars2regress[!(parameter.list$vars2regress %in% "dataset")]
}


# get stratification details
if (!is.na(which.strata)){
  which.strata.specified <- which.strata
  which.strata <- c()
  for (i in 1:length(which.strata.specified)){
   which.strata <- c(which.strata, input.barcode.strata[[which.data[i]]][[which.strata.specified[i]]])
  }
} 

# ensure correct directory is specified
stopifnot(exists("input.type")) 

# ensure input data are specified
stopifnot(exists("input_set")) 

try({
 parameter.list$organism.include <- rep(parameter.list$organism.include, length(input_data))
}, silent = T)

# if set_names are not specified, set_names <-  ""
if (exists("parameter.list$set_names")){
  stopifnot(is.character(parameter.list$set_names))
} else {parameter.list$set_names <- ""}

# if pilot input are not specified, pilot_input <-  ""
if (exists("pilot_input")){
  stopifnot(length(pilot_input) == 1)
} else {pilot_input <- ""}

total_sets <- input_set

```


```{r c4 - analysis log}

df.log <- initiateLog("1, Preprocessing & QC")

if (input.type == 1){
  df.log <- addLogEntry("Input Data (.Rdata)", input_data, df.log, "input_data")
  df.log <- addLogEntry("Input RT Barcode (.txt)", input_rt_barcodes, df.log, "input_rt_barcodes")
  df.log <- addLogEntry("Input PCR Barcode (.txt)", input_pcr_barcodes, df.log, "input_pcr_barcodes")
} else if (input.type == 2){
  df.log <- addLogEntry("Input Cell Ranger", input_data, df.log, "input_data")
} else if (input.type == 3){
  df.log <- addLogEntry("Input TPM Matrix", input_data, df.log, "input_data")
}


df.log <- addLogEntry("gene/cell Upper Limit (N/cell)", as.character(parameter.list$gene.upperlimit), df.log, "gene.upperlimit")
df.log <- addLogEntry("gene/cell Lower Limit (N/cell)", as.character(parameter.list$gene.lowerlimit), df.log, "gene.lowerlimit")

df.log <- addLogEntry("Filtered by unmatched reads", 
                      as.character(parameter.list$unmatched.rate.filter.flag), df.log, "unmatched.rate.filter.flag")

df.log <- addLogEntry("Mitochondrial Content Upper Limit (%/cell)", as.character(parameter.list$mt.upperlimit), df.log, "mt.upperlimit")

if ("subsample_factor" %in% names(parameter.list)){
  if (parameter.list$subsample_factor == 1){subsampled <- FALSE} else {subsampled <- TRUE}
} else {
  parameter.list$subsample_factor <- 1
  subsampled <- FALSE
}

df.log <- addLogEntry("Data Subsampled", subsampled, df.log, "subsampled")

df.log <- addLogEntry("Cluster Resolution", as.character(parameter.list$cluster.resolution), df.log, "cluster.resolution")
df.log <- addLogEntry("Filtered by organism", as.character(parameter.list$organism.filter.flag), df.log, "organism.filter.flag")

if (parameter.list$organism.filter.flag) {
  df.log <- addLogEntry("Species Included", as.character(parameter.list$organism.include), df.log, "organism.include")
}

df.log <- addLogEntry("Number of Workers", as.character(n.workers$normScale), df.log, "normScale")
df.log <- addLogEntry("Max Memory (Gb)", as.character(max.memory), df.log, "max.memory")

if (!is.na(which.strata)) {
  df.log <- addLogEntry("Barcode(s) Included", which.strata, df.log, "which.strata")
}

df.log <- addLogEntry("Variables regressed out", as.character(parameter.list$vars2regress), df.log, "vars2regress")
df.log <- addLogEntry("Input Species", all_input_organisms, df.log, "all_input_organisms")

df.log <- addLogEntry("Feature selection method", parameter.list$feature.select.method, df.log, "feature.select.method")
df.log <- addLogEntry("Scaling method", parameter.list$scale.method, df.log, "scale.method")
df.log <- addLogEntry("PCA component selection method", parameter.list$pca.component.select.method, df.log, "pca.component.select.method")



```

```{r}

loadMoffat.dev <- function(import_set, subsample_factor, input_organisms, organism_include, dir) {

  # load gene count matrix
  import_set_path <- paste(dir, import_set, sep ="")
  load(import_set_path[1])

  # assign row (genes) and column (cells) names
  gene_count2 <- gene_count



  # filter out incorrect species genes immediately.
  all.genes <- rownames(gene_count2)
  include.which.genes <- rep(FALSE, length(all.genes))
  if ((length(input_organisms) > length(organism_include)) | (length(organism_include) == 1)){
    for (i in 1:length(organism_include)){
      if (organism_include[i] == "Mm"){
        include.which.genes[grepl("MUSG", all.genes)] <- T
      } else if (organism_include[i] == "Hs"){
        include.which.genes[!(grepl("MUSG", all.genes))] <- T
      }
    }

    df_gene <- df_gene[include.which.genes, ]
    gene_count <- gene_count[include.which.genes, ]
    gene_count2 <- gene_count2[include.which.genes, ]
  }
  
    # Infer organism
  # orgs <- scMiko::inferSpecies(gene_count2, input_organisms)
  orgs <-  detectSpecies(gene_count2)
  df_cell$orgs <- orgs

  # only keep protein coding genes

  # assign gene and cell names
  rownames(gene_count2) <- df_gene$gene_id
  colnames(gene_count2) <- df_cell$sample


  # subsample data
  if (subsample_factor < 1){
    col_ind <- c(1:dim(gene_count2)[2])
    subsample_ind <- sample(col_ind, round(subsample_factor * dim(gene_count2)[2]), replace = FALSE)
    gene_count2 <- gene_count2[,subsample_ind]
  } else {
    subsample_ind <- c(1:dim(gene_count2)[2])
  }

  # Need names vectors to add additional metadata to Seurat object
  gNames <- as.character( df_gene$gene_name );
  names(gNames) <- as.character( df_gene$gene_id );
  names(gNames) <-gsub("\\\\..*","",as.vector( names(gNames)))

  # Add PCR barcodes
  pcr.barcode.flag <- F
  if (dir != import_set_path[2]){
    try({
      grps <- read.csv(import_set_path[2], header=T,sep="\\t",stringsAsFactors=F);
      wells <- gsub("Han_[0-9]+_([A-Z0-9]{2,3}).[ACGT]+$","\\\\1",colnames(gene_count2));
      myGrps <- grps$ConditionGroup[ match(wells,grps$sampleWell) ];
      names(myGrps) <- colnames(gene_count2)
      myGrps <- myGrps[subsample_ind]
      pcr.barcode.flag <- T
    }, silent = T)
  }


  # Add RT barcode
  rtBC <- tryCatch({
    rtBC <- read.csv(import_set_path[3],header=T,sep="\\t",stringsAsFactors=F);
  }, error = function(e){
    rtBC <- read.csv2(import_set_path[3],header=T,sep="\t",stringsAsFactors=F)
    return(rtBC)
  })
  rtBC <- rtBC[subsample_ind, ]
  if (is.null(rtBC$Plate)) rtBC$Plate <- ""
  rtBC$plate.well.sample <- paste0("P", rtBC$Plate, ".", rtBC$Well, ".", rtBC$Sample.type)


  barcodes <- gsub(".+([ACGT]{10}$)","\\\\1",colnames(gene_count2));
  if (unique(barcodes) == "\\1")  barcodes <- gsub(".+([ACGT]{10}$)","\\1",colnames(gene_count2));
  idx <- match( barcodes,rtBC$rt.bc );
  rtGrp <- rtBC$Sample.type[idx];
  names(rtGrp) <- colnames(gene_count2);
  rtPWS <- data.frame(
    plate = rtBC$Plate[idx],
    well = rtBC$Well[idx],
    sample = rtBC$Sample.type[idx])
  rtPWS.summary <- rtPWS %>% dplyr::group_by(plate, well, sample) %>% dplyr::tally()

  # remove .* suffix in ensembl
  rownames(gene_count2)<-gsub("\\\\..*","",as.vector( rownames(gene_count2)))

  # Create seurat object with count matrix data
  so <- CreateSeuratObject(counts=gene_count2,project="Hong-sciSeq3",min.cells=0,min.features=0,names.field=2,names.delim="." )

  # Add gene symbols as meta data that we can use later
  mat_ens <- rownames(so@assays[["RNA"]])
  match.id <- match(mat_ens,  names(gNames))

  so[["RNA"]] <- AddMetaData( object=so[["RNA"]],metadata=names(gNames)[match.id],col.name="ENSEMBL");
  so[["RNA"]] <- AddMetaData( object=so[["RNA"]],metadata=as.vector((gNames)[match.id]),col.name="SYMBOL");

  # assign meta data
  so.meta <- so@meta.data
  so.meta$ind <- seq(1, nrow(so.meta))
  so.meta$sample <- rownames(so.meta)
  so.merge <- merge(so.meta, df_cell, by = "sample", all.x = T)
  so.merge <- so.merge %>% dplyr::arrange(ind)
  if ("unmatched_rate" %in% colnames(df_cell))  so[["unmatched.rate"]] <- so.merge$unmatched_rate
  so[["Organism"]] <- so.merge$orgs

  if (length(import_set) > 1){ # same as above, if barecode and well info. provided, assign to so structure
    # Add group #s for mapping cells to PCR condition groups
    if (pcr.barcode.flag) {
      so[["group"]] <- myGrps;
    } else {
      so[["group"]] <- "unspecified";
    }

    # Add in cell line group info
    so$Barcode <- rtGrp;

  }

  try({so@misc[["gene.info"]] <- df_gene}, silent = T)
  try({so@misc[["cell.info"]] <- df_cell}, silent = T)
  try({so@misc[["plate.summary"]] <- rtPWS.summary}, silent = T)

  # return output
  output <- list(so, gNames)

  return(output)
}


```


```{r}
# 
# function(import_set, input_organisms, dir = "") {
# 
#   import_set_path <- paste(dir, import_set, sep ="")
# 
#   # import cell ranger-processed data
#   expression_matrix <- Seurat::Read10X(data.dir = import_set_path[1])
# 
#   # file.path(import_set_path[1], "barcodes.tsv")
#   # list.files(path = import_set_path[1], pattern=NULL, all.files=FALSE,
#   #   full.names=FALSE)
# 
#   # assign to secondary matrix
#   expression_matrix2 <- expression_matrix
# 
#   # import gene and ensembl names
#   load.success <- F
#   try({
#     feature.names = read.delim(paste(import_set_path[1], "/genes.tsv", sep = ""),
#                                header = FALSE,
#                                stringsAsFactors = FALSE)
#     load.success <- T
#   }, silent = T)
#   if (!load.success){
#     feature.names = read.delim(paste(import_set_path[1], "/features.tsv", sep = ""),
#                                header = FALSE,
#                                stringsAsFactors = FALSE)
#   }
# 
# 
#   # remove hg19 tags
#   feature.names[,1] <- gsub("hg19_", "", feature.names[,1])
#   feature.names[,2] <- gsub("hg19_", "", feature.names[,2])
# 
#   # create gene list
#   gNames <- as.character( feature.names$V2 );
#   names(gNames) <- as.character( feature.names$V1 );
#   names(gNames) <-gsub("\\..*","",as.vector( names(gNames)))
# 
#   # convert gene symbols to ensemble (in expression matrix)
#   rownames(expression_matrix2) <- names(gNames)
# 
#   # assign species
#   orgs <- detectSpecies(expression_matrix2)
#   # orgs <- scMiko::m1.inferSpecies(expression_matrix2, input_organisms)
# 
#   # create seurat object
#   # so = CreateSeuratObject(counts = expression_matrix2,project="cell_ranger",min.cells=3,min.features=200)
#   so = CreateSeuratObject(counts = expression_matrix2,project="cell_ranger",min.cells=0,min.features=0)
# 
#   # add gene symbols as meta data in seurat object
#   mat_ens <- rownames(so@assays[["RNA"]])
#   match.id <- match(mat_ens, names(gNames))
#   # gNames_filtered <- gNames[match.id]
# 
#   # Add gene symbols as meta data that we can use later
#   # so[["RNA"]] <- AddMetaData( object=so[["RNA"]],metadata=gNames_filtered,col.name="gene_name");
#   so[["RNA"]] <- AddMetaData( object=so[["RNA"]],metadata=names(gNames[match.id]),col.name="ENSEMBL");
#   so[["RNA"]] <- AddMetaData( object=so[["RNA"]],metadata=as.vector(gNames[match.id]),col.name="SYMBOL");
# 
#   # Add in inferred organism
#   so$Organism <- orgs;
# 
#   # Specify barcodes
#   so$Barcode <- "unspecified";
# 
#   output <- list(so, gNames)
# 
#   return(output)
# }
# 

```

```{r c6 - load_data}

# Run import function
so.list <- list()
gNames.list.list <- list()

# sequence indices
idx.seq <- c(1, 1+length(input_data), 1+(2*length(input_data)))

# assertions
stopifnot(length(dir) == length(input_data))
stopifnot(length(all_input_organisms) == length(input_data))

for (i in 1:length(input_data)){
  
  
  cur.total_sets <- total_sets[idx.seq + (i-1)]
  
  if (input.type[i] == 1) {
    output.list <- loadMoffat(import_set = cur.total_sets, 
                              subsample_factor = parameter.list$subsample_factor, 
                              input_organisms = stringr::str_trim(unlist(strsplit(all_input_organisms[i], ",")))  , 
                              organism_include = parameter.list$organism.include[i], 
                              dir = dir[i])
  } else if (input.type[i] == 2){
    output.list <- loadCellRanger(import_set = cur.total_sets, 
                                  input_organisms = stringr::str_trim(unlist(strsplit(all_input_organisms[i], ","))), 
                                  dir = dir[i])
  } else if (input.type[i] == 3){
    
    output.list <- loadMat(import_set = cur.total_sets, 
                           input_organisms = stringr::str_trim(unlist(strsplit(all_input_organisms[i], ","))), 
                           dir =  dir[i])
    
  }
  
  # unlist outputs
  so.list[[i]] <- output.list[[1]]
  gNames.list.list[[i]] <- output.list[[2]]
  
  if ((unique(so.list[[i]]@meta.data[["Barcode"]]) == "unspecified") ){
    so.list[[i]]@meta.data[["Barcode"]] <- paste("T", i, sep = "")
  } else if (is.na(unique(so.list[[i]]@meta.data[["Barcode"]]))){
    stop("Barcodes incorrectly specified") #140620
  }
  
}

# assign set labels and merge seurat objects if multiple present
for (i in 1:length(so.list)){
  if (length(which.data) == length(which.strata)){
    set.label <- paste0(which.data[i],"_", which.strata[i])
  } else {
    set.label <- which.data[i]
  }
  so.list[[i]][["dataset"]] <- set.label
}

so <- mergeSeuratList(so.list)
rm(so.list);
rm(output.list);
invisible({gc()})

# Assumed that all gene lists are identical
gNames.list <- gNames.list.list[[1]]

```



```{r c8 - filter_data_by_subset, include = FALSE}

# define subset parameters
if ("organism.include" %in% names(parameter.list)){
  if (sum(c("Mm", "Hs") %in%  parameter.list$organism.include) > 0){
    subset.df <- data.frame(field = "Organism", subgroups = parameter.list$organism.include)
  }
  
  # subset data
  so <- subsetSeurat(so, subset.df)
} 


```




```{r visualize plate layout, fig.width=8, fig.height=6}

visualize.plate <- F
if (visualize.plate){
  
  so.meta <- so@meta.data

myLetters <- toupper(letters[1:26])

so.meta$well <- stringr::str_extract(rownames(so.meta), "[A-Z0-9]*")
so.meta$row <- match(stringr::str_extract(so.meta$well, "[A-Z]*"), myLetters)
so.meta$col <- as.numeric(stringr::str_remove(so.meta$well, "[A-Z]*"))
so.meta$bc <- gsub("\\.", "", stringr::str_remove(rownames(so.meta), "[A-Z0-9_]*"))

# import 384 ligation bc
df.384lig <- readxl::read_xlsx("C:/Users/Owner/Dropbox/PDF Projects - JM/Data/scRNA-seq/01_sci-RNA-seq3_Hong_Kevin_Jason/ReportFolder/sciRNAseq3_RunHH14_210729/siRNAseq3_Ligation_barcodes_090821.xlsx")


# so.meta <- so.meta %>% dplyr::filter(grepl("L1_|L2_|L3_", so.meta$Barcode))
so.meta <- so.meta %>% dplyr::filter(grepl("08_", so.meta$Barcode))

table(so.meta$Barcode)

# dat  <- so.meta
 # row.col = "row"; col.col = "col"; value.col = "nCount_RNA"
asPlate <- function(dat, row.col = "row", col.col = "col", value.col = "nCount_RNA", aggregate.function = "sum"){
  

  
  if (value.col == "n_nuclei"){
      # dat <- dat %>% dplyr::select(c(row.col, col.col, value.col))
  dat$col.val <- dat[ ,col.col]
  dat$row.val <- dat[ ,row.col]
  
    u.row <- unique(dat [,row.col]); u.row <- u.row[order(u.row)]
  u.col <- unique(dat [,col.col]); u.col <- u.col[order(u.col)]
    dat <- dat %>%
      dplyr::group_by(col.val, row.val) %>%
      dplyr::summarise(value = length(col.val))
  } else {
      dat <- dat %>% dplyr::select(c(row.col, col.col, value.col))
  dat$col.val <- dat[ ,col.col]
  dat$row.val <- dat[ ,row.col]
  
    u.row <- unique(dat [,row.col]); u.row <- u.row[order(u.row)]
  u.col <- unique(dat [,col.col]); u.col <- u.col[order(u.col)]
    dat$value <- dat[ ,value.col]
  }

  agg.func <- sum
  if (aggregate.function == "sum"){
    agg.func <- sum
  } else if (aggregate.function == "mean"){
    agg.func <- mean
  } else if (aggregate.function == "median"){
    agg.func <- median
  }
  
  
  # if (value.col)

  dat.sum <- dat %>%
    dplyr::group_by(row.val, col.val) %>%
    dplyr::summarise(x=  agg.func(value, na.rm = T))
  
  dat.sum$row.val <- factor(dat.sum$row.val, levels = u.row)
  dat.sum$col.val <- factor(dat.sum$col.val, levels = u.col)
  
  dat.sum %>%
    ggplot(aes(y = (row.val), x = (col.val), fill = x)) + 
    geom_tile() + 
    labs(title = paste0(value.col, "/well"), subtitle = "96 well plate", x = "columns (1-12)", y = "rows (A-H)", fill = value.col, 
         caption = paste0("Aggregation Function: ", aggregate.function)) + 
    viridis::scale_fill_viridis() 
  
}


plt.plate.umi <-asPlate(dat = so.meta, row.col = "row", col.col = "col", value.col = "nCount_RNA", aggregate.function = "mean")
plt.plate.gene <-asPlate(dat = so.meta, row.col = "row", col.col = "col", value.col = "nFeature_RNA", aggregate.function = "mean")
plt.plate.ur <-asPlate(dat = so.meta, row.col = "row", col.col = "col", value.col = "unmatched.rate", aggregate.function = "mean")
plt.plate.nuclei <-asPlate(dat = so.meta, row.col = "row", col.col = "col", value.col = "n_nuclei")



plt.plate <- cowplot::plot_grid(plt.plate.umi, plt.plate.gene, plt.plate.ur, plt.plate.nuclei, align = "hv")

plt.plate
table(so.meta$Barcode)


}


# savePDF("p14_08Mix_hNP_plateStatistics_visualized_090821.pdf", plt.plate, fig.width=8, fig.height=6)


```



```{r visualize RT plate layout, fig.width=16, fig.height=6}

visualize.plate <- F
if (visualize.plate){
  
  so.meta <- so@meta.data

myLetters <- toupper(letters[1:26])

so.meta$well <- stringr::str_extract(rownames(so.meta), "[A-Z0-9]*")
so.meta$row <- match(stringr::str_extract(so.meta$well, "[A-Z]*"), myLetters)
so.meta$col <- as.numeric(stringr::str_remove(so.meta$well, "[A-Z]*"))
so.meta$bc <- gsub("\\.", "", stringr::str_remove(rownames(so.meta), "[A-Z0-9_]*"))

 # Add RT barcode
  df.384lig <- tryCatch({
    rtBC <- read.csv("C:/Users/Owner/Dropbox/PDF Projects - JM/Data/scRNA-seq/01_sci-RNA-seq3_Hong_Kevin_Jason/ReportFolder/sciRNAseq3_RunHH14_210729/Run14_rt_barcodes_20210730NewFormat_288used.txt",header=T,sep="\\t",stringsAsFactors=F);
  }, error = function(e){
    df.384lig <- read.csv2("C:/Users/Owner/Dropbox/PDF Projects - JM/Data/scRNA-seq/01_sci-RNA-seq3_Hong_Kevin_Jason/ReportFolder/sciRNAseq3_RunHH14_210729/Run14_rt_barcodes_20210730NewFormat_288used.txt",header=T,sep="\t",stringsAsFactors=F)
    return(df.384lig)
  })

so.meta <- so.meta %>% dplyr::filter(grepl("08_", so.meta$Barcode))


u.plates <- unique(df.384lig$Plate)
u.well <- unique(df.384lig$Well)

for (i in 1:length(u.plates)){
  plate.name <- u.plates[i]
  for (j in 1:length(u.well)){
    well.name <- u.well[j]
    which.ind <- which((df.384lig$Plate == plate.name) & (df.384lig$Well == well.name))
    bc.name <- df.384lig$rt.bc[which.ind]  
    which.match <- which(grepl(bc.name, so.meta$bc))
    df.384lig$n.nuclei[which.ind] <- length(which.match)
    df.384lig$nCount[which.ind] <- mean(so.meta$nCount_RNA[which.match], na.rm = T)
    df.384lig$nFeature[which.ind] <- mean(so.meta$nFeature_RNA[which.match], na.rm = T)
    df.384lig$unmatch_rate[which.ind] <- mean(so.meta$unmatched.rate[which.match], na.rm = T)
  }
}

df.384lig$row <- match(stringr::str_extract(df.384lig$Well, "[A-Z]*"), myLetters)
df.384lig$col <- as.numeric(stringr::str_remove(df.384lig$Well, "[A-Z]*"))

    u.row <- unique(df.384lig$row); u.row <- u.row[order(u.row)]
  u.col <- unique(df.384lig$col); u.col <- u.col[order(u.col)]


 df.384lig$row <- factor(df.384lig$row, levels = u.row)
  df.384lig$col <- factor(df.384lig$col, levels = u.col)

plt.plate.nuclei <- df.384lig %>%
    ggplot(aes(y = (row), x = (col), fill = n.nuclei)) +
    geom_tile() +
    labs(title = paste0("n.nuclei/well"), subtitle = "96 well plate", x = "columns (1-12)", y = "rows (A-H)", fill = "n.nuclei") + 
    viridis::scale_fill_viridis() + 
    facet_wrap(~Plate)

plt.plate.umi <- df.384lig %>%
    ggplot(aes(y = (row), x = (col), fill = nCount)) +
    geom_tile() +
    labs(title = paste0("nCount_RNA/well"), subtitle = "96 well plate", x = "columns (1-12)", y = "rows (A-H)", fill = "nCount_RNA") + 
    viridis::scale_fill_viridis() + 
    facet_wrap(~Plate)

plt.plate.gene <- df.384lig %>%
    ggplot(aes(y = (row), x = (col), fill = nFeature)) +
    geom_tile() +
    labs(title = paste0("nFeature_RNA/well"), subtitle = "96 well plate", x = "columns (1-12)", y = "rows (A-H)", fill = "nFeature_RNA") + 
    viridis::scale_fill_viridis() + 
    facet_wrap(~Plate)

plt.plate.ur <- df.384lig %>%
    ggplot(aes(y = (row), x = (col), fill = unmatch_rate)) +
    geom_tile() +
    labs(title = paste0("unmatched.rate/well"), subtitle = "96 well plate", x = "columns (1-12)", y = "rows (A-H)", fill = "unmatched.rate") + 
    viridis::scale_fill_viridis() + 
    facet_wrap(~Plate)

plt.plate <- cowplot::plot_grid(plt.plate.umi, plt.plate.gene, plt.plate.ur, plt.plate.nuclei, align = "hv")

plt.plate
# table(so.meta$Barcode)


}

# savePDF("p14_08Mix_hNP_plateStatistics_visualized_090821.pdf", plt.plate, fig.width=8, fig.height=6)
# savePDF("p14_08Mix_RTPlate_visualized_090821.pdf", plt.plate, fig.width=16, fig.height=6)
```



```{r visualize ligation plate layout, fig.width=16, fig.height=12}

visualize.plate <- F
if (visualize.plate){
  
  so.meta <- so@meta.data

myLetters <- toupper(letters[1:26])

so.meta$well <- stringr::str_extract(rownames(so.meta), "[A-Z0-9]*")
so.meta$row <- match(stringr::str_extract(so.meta$well, "[A-Z]*"), myLetters)
so.meta$col <- as.numeric(stringr::str_remove(so.meta$well, "[A-Z]*"))
so.meta$bc <- gsub("\\.", "", stringr::str_remove(rownames(so.meta), "[A-Z0-9_]*"))

so.meta <- so.meta %>% dplyr::filter(grepl("37_|38_", so.meta$Barcode))

# import 384 ligation bc
df.384lig <- readxl::read_xlsx("C:/Users/Owner/Dropbox/PDF Projects - JM/Data/scRNA-seq/01_sci-RNA-seq3_Hong_Kevin_Jason/ReportFolder/sciRNAseq3_RunHH14_210729/siRNAseq3_Ligation_barcodes_090821.xlsx")

u.plates <- unique(df.384lig$`Plate Name`)
u.well <- unique(df.384lig$`Well Position`)

for (i in 1:length(u.plates)){
  plate.name <- u.plates[i]
  for (j in 1:length(u.well)){
    well.name <- u.well[j]
    which.ind <- which((df.384lig$`Plate Name` == plate.name) & (df.384lig$`Well Position` == well.name))
    bc.name <- df.384lig$LigBarcodes...10[which.ind]  
    which.match <- which(grepl(bc.name, so.meta$bc))
    df.384lig$n.nuclei[which.ind] <- length(which.match)
    df.384lig$nCount[which.ind] <- mean(so.meta$nCount_RNA[which.match], na.rm = T)
    df.384lig$nFeature[which.ind] <- mean(so.meta$nFeature_RNA[which.match], na.rm = T)
    df.384lig$unmatch_rate[which.ind] <- mean(so.meta$unmatched.rate[which.match], na.rm = T)
  }
}

df.384lig$row <- match(stringr::str_extract(df.384lig$`Well Position`, "[A-Z]*"), myLetters)
df.384lig$col <- as.numeric(stringr::str_remove(df.384lig$`Well Position`, "[A-Z]*"))

    u.row <- unique(df.384lig$row); u.row <- u.row[order(u.row)]
  u.col <- unique(df.384lig$col); u.col <- u.col[order(u.col)]


 df.384lig$row <- factor(df.384lig$row, levels = u.row)
  df.384lig$col <- factor(df.384lig$col, levels = u.col)

plt.plate.nuclei <- df.384lig %>%
    ggplot(aes(y = (row), x = (col), fill = n.nuclei)) +
    geom_tile() +
    labs(title = paste0("n.nuclei/well"), subtitle = "96 well plate", x = "columns (1-12)", y = "rows (A-H)", fill = "n.nuclei") + 
    viridis::scale_fill_viridis() + 
    facet_wrap(~`Plate Name`)

plt.plate.umi <- df.384lig %>%
    ggplot(aes(y = (row), x = (col), fill = nCount)) +
    geom_tile() +
    labs(title = paste0("nCount_RNA/well"), subtitle = "96 well plate", x = "columns (1-12)", y = "rows (A-H)", fill = "nCount_RNA") + 
    viridis::scale_fill_viridis() + 
    facet_wrap(~`Plate Name`)

plt.plate.gene <- df.384lig %>%
    ggplot(aes(y = (row), x = (col), fill = nFeature)) +
    geom_tile() +
    labs(title = paste0("nFeature_RNA/well"), subtitle = "96 well plate", x = "columns (1-12)", y = "rows (A-H)", fill = "nFeature_RNA") + 
    viridis::scale_fill_viridis() + 
    facet_wrap(~`Plate Name`)

plt.plate.ur <- df.384lig %>%
    ggplot(aes(y = (row), x = (col), fill = unmatch_rate)) +
    geom_tile() +
    labs(title = paste0("unmatched.rate/well"), subtitle = "96 well plate", x = "columns (1-12)", y = "rows (A-H)", fill = "unmatched.rate") + 
    viridis::scale_fill_viridis() + 
    facet_wrap(~`Plate Name`)

plt.plate <- cowplot::plot_grid(plt.plate.umi, plt.plate.gene, plt.plate.ur, plt.plate.nuclei, align = "hv")

plt.plate
# table(so.meta$Barcode)


}

# savePDF("p14_08Mix_hNP_plateStatistics_visualized_090821.pdf", plt.plate, fig.width=8, fig.height=6)
# savePDF("p14_PR_GBM_ligationPlate_visualized_090821.pdf", plt.plate, fig.width=16, fig.height=12)
```




```{r c9 - subset barcodes, include = FALSE}
so <- barcodeLabels(so, which.strata)

```


```{r c10 - get_mt_content, include = FALSE}

# MITOCHONDRIAL CONTENT
so <- getMitoContent(so, gNames.list)

```


```{r c11 - barcode rankings, fig.width= 15, fig.height=5, include = FALSE}

# get metadata
df.meta.qc <- so@meta.data

# rank QC metrics
df.meta.qc$rank.umi <- rank(-df.meta.qc$nCount_RNA)
df.meta.qc$rank.gene <- rank(-df.meta.qc$nFeature_RNA)
df.meta.qc$rank.mt <- rank(-df.meta.qc$percent.mt)

df.meta.qc$mitoOmit <- df.meta.qc$percent.mt > parameter.list$mt.upperlimit
df.meta.qc$mitoColor <- "black"
df.meta.qc$mitoColor[df.meta.qc$mitoOmit] <- "tomato"

plt.gene.umi.mito <- df.meta.qc %>%
  ggplot(aes(x = nCount_RNA, y = (nFeature_RNA))) + 
  geom_point(color = df.meta.qc$mitoColor) + 
  scale_y_continuous(trans='log10') + 
  scale_x_continuous(trans='log10') + 
  geom_hline(yintercept = parameter.list$gene.upperlimit , color = "red", linetype = "dashed") + 
  geom_hline(yintercept = parameter.list$gene.lowerlimit, color = "red", linetype = "dashed") + 
  xlab("Transcript counts (n/cell)") + ylab("Gene counts (n/cell)") + 
  theme_classic() + labs(title = "Transcript vs Gene Counts", caption = "dashed red line: gene count thresholds\nred points: high mt content.")

# generate ranking plots
plt.umi.rank.mito <- df.meta.qc %>%
  ggplot(aes(x = rank.umi, y = (nCount_RNA))) + 
  geom_point(color = df.meta.qc$mitoColor) + 
  scale_y_continuous(trans='log10') + 
  xlab("Cell Rank") + ylab("Transcript counts (n/cell)") + 
  theme_classic() + labs(title = "Transcript count ranking", caption = "red points: high mt content.")

plt.gene.rank.mito <- df.meta.qc %>%
  ggplot(aes(x = rank.gene, y = (nFeature_RNA))) + 
  geom_point(color = df.meta.qc$mitoColor) + scale_y_continuous(trans='log10') + 
  geom_hline(yintercept = parameter.list$gene.upperlimit , color = "red", linetype = "dashed") + 
  geom_hline(yintercept = parameter.list$gene.lowerlimit, color = "red", linetype = "dashed") + 
  xlab("Cell Rank") + ylab("Gene counts (n/cell)") + 
  theme_classic() + labs(title = "Gene count ranking", caption = "dashed red line: gene count thresholds\nred points: high mt content.")

plt.mt.rank.mito <- df.meta.qc %>%
  ggplot(aes(x = rank.mt, y = (percent.mt))) + 
  geom_point(color = df.meta.qc$mitoColor) + scale_y_continuous(trans='log10') + 
  geom_hline(yintercept = parameter.list$mt.upperlimit, color = "red", linetype = "dashed")  + 
  xlab("Cell Rank") + ylab("Mitochondrial Content (%/cell)") + 
  theme_classic() + labs(title = "Mitochondrial content ranking", caption = "red: high mt content.")


if (input.type == 1){
  
  unmatch.median <- median(df.meta.qc$unmatched.rate)
  unmatch.mad <- mad(df.meta.qc$unmatched.rate, unmatch.median)
  
  if (parameter.list$unmatched.rate.filter.flag){
    unmatch.high <- unmatch.median + (1.96*unmatch.mad)
    unmatch.low <- unmatch.median - (1.96*unmatch.mad)
  } else {
    unmatch.high <- 1
    unmatch.low <- 0
  }

  df.meta.qc$matchOmit <- (df.meta.qc$unmatched.rate > unmatch.high) | (df.meta.qc$unmatched.rate < unmatch.low)
  df.meta.qc$matchColor <- "black"
  df.meta.qc$matchColor[df.meta.qc$matchOmit] <- "skyblue"
  
  plt.unmatch.umi <- df.meta.qc %>%
    ggplot(aes(x = nCount_RNA, y = (unmatched.rate))) + 
    geom_point(color = df.meta.qc$matchColor)  + 
    xlab("Transcript Count") + ylab("Unmatched Rate") + 
    geom_hline(yintercept = unmatch.high, color = "red", linetype = "dashed")  + 
    geom_hline(yintercept = unmatch.low, color = "red", linetype = "dashed")  + 
    theme_classic() + labs(title = "Transcript count vs. unmatched rate", caption = "blue: unmatched rate outliers")
  
  df.meta.qc$unmatch.rank <- rank(-df.meta.qc$unmatched.rate)
  plt.umatch.rank <- df.meta.qc %>%
    ggplot(aes(x = unmatch.rank, y = (unmatched.rate))) + 
    geom_point(color = df.meta.qc$matchColor) + 
    scale_y_continuous(trans='log10') + 
    geom_hline(yintercept = unmatch.high, color = "red", linetype = "dashed")  + 
    geom_hline(yintercept = unmatch.low, color = "red", linetype = "dashed")  + 
    xlab("Cell Rank") + ylab("Unmatched Rate") + 
    theme_classic() + labs(title = "Unmatched rate ranking", caption = "blue: unmatched rate outliers")
  
  
  plt.gene.umi.match <- df.meta.qc %>%
    ggplot(aes(x = nCount_RNA, y = (nFeature_RNA))) + 
    geom_point(color = df.meta.qc$matchColor) + 
    scale_y_continuous(trans='log10') + 
    scale_x_continuous(trans='log10') + 
    geom_hline(yintercept = parameter.list$gene.upperlimit , color = "red", linetype = "dashed") + 
    geom_hline(yintercept = parameter.list$gene.lowerlimit, color = "red", linetype = "dashed") + 
    xlab("Transcript counts (n/cell)") + ylab("Gene counts (n/cell)") + 
    theme_classic() + labs(title = "Transcript vs Gene Counts", caption ="blue: unmatched rate outliers")
  
  # generate ranking plots
  plt.umi.rank.match <- df.meta.qc %>%
    ggplot(aes(x = rank.umi, y = (nCount_RNA))) + 
    geom_point(color = df.meta.qc$matchColor) + 
    scale_y_continuous(trans='log10') + 
    xlab("Cell Rank") + ylab("Transcript counts (n/cell)") + 
    theme_classic() + labs(title = "Transcript count ranking", caption = "blue: unmatched rate outliers")
  
  plt.gene.rank.match <- df.meta.qc %>%
    ggplot(aes(x = rank.gene, y = (nFeature_RNA))) + 
    geom_point(color = df.meta.qc$matchColor) + scale_y_continuous(trans='log10') + 
    geom_hline(yintercept = parameter.list$gene.upperlimit , color = "red", linetype = "dashed") + 
    geom_hline(yintercept = parameter.list$gene.lowerlimit, color = "red", linetype = "dashed") + 
    xlab("Cell Rank") + ylab("Gene counts (n/cell)") + 
    theme_classic() + labs(title = "Gene count ranking", caption = "blue: unmatched rate outliers")
  
  plt.mt.rank.match <- df.meta.qc %>%
    ggplot(aes(x = rank.mt, y = (percent.mt))) + 
    geom_point(color = df.meta.qc$matchColor) + scale_y_continuous(trans='log10') + 
    geom_hline(yintercept = parameter.list$mt.upperlimit, color = "red", linetype = "dashed")  + 
    xlab("Cell Rank") + ylab("Mitochondrial Content (%/cell)") + 
    theme_classic() + labs(title = "Mitochondrial content ranking", caption = "blue: unmatched rate outliers")
  
} else {
  plt.umatch.rank <- NULL
  plt.unmatch.umi <- NULL
  unmatch.high <- 1
  unmatch.low <- 0
}

if (parameter.list$print.inline) {
 cowplot::plot_grid(plt.gene.umi.mito, plt.umi.rank.mito, plt.gene.rank.mito, plt.mt.rank.mito, ncol = 4, align = "h") 
  
  if (input.type == 1){
    ggpubr::ggarrange(plt.umi.rank.match, plt.gene.rank.match, plt.mt.rank.match,
                      plt.unmatch.umi, plt.gene.umi.match, plt.umatch.rank,ncol = 3, nrow = 2) 
  }
}



```


```{r c12 - Density plot of % mitochondrial genes, warning = FALSE, include = FALSE}

# MITOCHONDRIAL CONTENT DISTRIBUTIONS

calc_mito_content_by_bc <- function(so, set_name, bc_info_flag, input.type = 1){
  
  
  if (bc_info_flag == TRUE & input.type != 2){   
    # Compute mitochrondrial content for each cell type
    
    # get unique cell types
    barcodes <-unique(so@meta.data[["Barcode"]])
    
    # create dataframe to store mitochondrial content, stratified by cell type
    df_by_bc <- NULL
    
    # for each cell type
    for (i in 1:length(barcodes)) {
      percent.mt_by_bc <- as.vector(so$percent.mt[so$Barcode == barcodes[i]])
      count.feature_by_bc <- as.vector(so$nFeature_RNA[so$Barcode == barcodes[i]])
      count.rna_by_bc <- as.vector(so$nCount_RNA[so$Barcode == barcodes[i]])
      n_cells <- length(percent.mt_by_bc)
      cur_length <- dim(df_by_bc)[1]
      
      # get QC parameters for each cell sample
      df_by_bc <- bind_rows(
        df_by_bc,
          data.frame(barcode = barcodes[i],
                     percent.mt = percent.mt_by_bc,
                     count.feature = count.feature_by_bc,
                     count.rna = count.rna_by_bc)
      )
      
    }
  } else { # if no cell type information provided, pool everything together
    df_by_bc <- data.frame()
    percent.mt_by_bc <- as.vector(so$percent.mt)
    count.feature_by_bc <- as.vector(so$nFeature_RNA)
    count.rna_by_bc <- as.vector(so$nCount_RNA)
    df_by_bc <- data.frame(percent.mt_by_bc, count.feature_by_bc, count.rna_by_bc)
    colnames(df_by_bc) <- c("percent.mt", "count.feature", "count.rna")
    df_by_bc["barcode"] <-"unspecified"
    barcodes <- "unspecified"
    so@meta.data[["Barcode"]] <-  "unspecified"
  }
  
  # Create summary graphs 
  # plot density plot
  plt.handle <- ggplot(df_by_bc, aes(percent.mt, color=barcode)) + 
    geom_density(size = 1) + 
    ggtitle(paste(set_name)) + 
    xlab("Mitochrondrial Content (%)")
  
  output <- list(so, plt.handle)
  return(output)
}

# calculate mitochondrial content, stratified by barcode (if available)
output_by_bc <- calc_mito_content_by_bc(so, parameter.list$set_names, (length(total_sets)>1), input.type)

# retrive results
so <- output_by_bc[[1]]
plt.mito_by_bc <- output_by_bc[[2]] 

# show mito filter threhsold
if ("mt.upperlimit" %in% names(parameter.list)){
  if (parameter.list$mt.upperlimit >=0 & parameter.list$mt.upperlimit <= 100){
    plt.mito_by_bc <- plt.mito_by_bc + geom_vline(xintercept = parameter.list$mt.upperlimit, linetype = "dashed")
  }
}

if (parameter.list$print.inline){
  print(plt.mito_by_bc)
}



```


```{r c13 - qc_violin_plots, fig.height= 5, fig.width = 12, include = FALSE}

if (length(unique(so@meta.data[["subset_group"]]))> 1){
  df.qc <- data.frame(so@meta.data[["subset_group"]],  
                      so@meta.data[["nFeature_RNA"]], 
                      so@meta.data[["nCount_RNA"]], 
                      so@meta.data[["percent.mt"]]) 
  colnames(df.qc) <- c("group", "nFeature_RNA", "nCount_RNA", "percent.mt")
  
  df.qc_sum <- df.qc %>%
    group_by(group) %>%
    summarize(nFeature_mean = round(mean(nFeature_RNA),2),
              nCount_mean = round(mean(nCount_RNA),2),
              percent.mt_mean = round(mean(percent.mt),2), 
              nFeature_median = round(median(nFeature_RNA),2),
              nCount_median = round(median(nCount_RNA),2),
              percent.mt_median = round(median(percent.mt),2))
  
  nFeat_aov <- summary(aov(nFeature_RNA ~ group, data = df.qc))
  nCount_aov <- summary(aov(nCount_RNA ~ group, data = df.qc))
  mT_aov <- summary(aov(percent.mt ~ group, data = df.qc))
} else {
  df.qc <- data.frame()
  df.qc_sum <- data.frame()
  nFeat_aov <- c()
  nCount_aov <- c()
  mT_aov <- c()
}

violin.list <- QC.violinPlot(so, plt.log.flag = T)
plt.QC_violin <- violin.list[[1]]
plt.violin_by_subgroup <- violin.list[[2]]

if (parameter.list$print.inline){
  print(plt.QC_violin)
  print(plt.violin_by_subgroup)
}


```


```{r c14 - QC scatterplots, fig.height= 5, fig.width = 12, include = FALSE}

# Generate QC scatterplots
max.cells <- 40000
if (ncol(so) > max.cells) {
  plt.QC_scatter <- QC.scatterPlot(downsampleSeurat(so,  subsample.n = max.cells))
} else {
  plt.QC_scatter <- QC.scatterPlot(so)
}

if (parameter.list$print.inline)  {
  print(plt.QC_scatter)
}

```


```{r c15 - filter_QC, fig.width = 5, fig.height = 5, include = FALSE}

if (length(parameter.list$set_names) == 2) {
  set.n <- NULL
} else {
  set.n <- parameter.list$set_names
}
filter_output_list <- filterSeurat(so = so, 
                                   RNA.upperlimit = parameter.list$gene.upperlimit, 
                                   RNA.lowerlimit = parameter.list$gene.lowerlimit, 
                                   mt.upperlimit = parameter.list$mt.upperlimit, 
                                   unmatch.low = unmatch.low, 
                                   unmatch.high = unmatch.high,
                                   set_names = set.n)
so <- filter_output_list[["seurat"]]
plt.filter_pre_post <- filter_output_list[["plot"]]
filter.breakdown.list <- filter_output_list[["filter.breakdown"]]

rm(filter_output_list)


plt.filter.venn <- NULL
try({
  
  plt.filter.venn <- ggVennDiagram::ggVennDiagram(filter.breakdown.list)
  # plt.filter.venn <- gplots::venn(filter.breakdown.list, show.plot = F, col = gplots::rich.colors(palette="temperature", 3))
}, silent = T)


if (parameter.list$print.inline){
  print(plt.filter.venn)
  print(plt.filter_pre_post)
}


```


```{r c16 - normalize_data, warning = FALSE, include = FALSE}

# Run data normalization
normalization_method_ps <- "SCT" # "NFS" or "SCT"

enable.parallelization <- F
if ("conserve.memory" %in% names(parameter.list) ){
  if (is.logical(parameter.list$conserve.memory)){
    conserve.memory <- parameter.list$conserve.memory
  } else {
    conserve.memory <- F
  }
} else {
  conserve.memory <- F
}

if (n.workers$normScale == 1) enable.parallelization <- F

parameter.list$clip.range <- c(-sqrt(x = ncol(x = so[["RNA"]])/30), sqrt(x = ncol(x =
    so[["RNA"]])/30))
# if (parameter.list$clip.range > 10){
#   parameter.list$clip.range <- 10
# }

so<- scNormScale(so, gNames.list, normalization_method_ps, vars2regress = parameter.list$vars2regress,
                 n.workers = n.workers$normScale,  max.memory = (max.memory*20480/20 * 1024^2), 
                 enable.parallelization = enable.parallelization, conserve.memory = conserve.memory,
                 clip.range = parameter.list$clip.range)

# plot variable genes
var.gene.orig <- so@assays[["SCT"]]@var.features
plt.variable_genes <- variableGenes.Plot(so, gNames.list, parameter.list$set_names)
plt.variable_genes <- plt.variable_genes + 
  labs(title = "Variably-Expressed Genes") + 
  theme_miko(legend = T) + 
  theme(legend.position = "bottom")

if (parameter.list$print.inline) {
  
  print(plt.variable_genes)
}

```


```{r c17 - per cell statistics, include = FALSE}

# ptm <- proc.time()
df_nCount <- data.frame(so@meta.data[["nCount_RNA"]], so@meta.data[["nCount_SCT"]])
colnames(df_nCount) <- c("UMI/cell (raw)",  "UMI/cell (normalized)")
df_nCount_m <- melt(df_nCount)

df_nFeat <- data.frame(so@meta.data[["nFeature_RNA"]], so@meta.data[["nFeature_SCT"]])
colnames(df_nFeat) <- c("genes/cell (raw)","genes/cell (normalized)")
df_nFeat_m <- melt(df_nFeat)

plt.UMI_per_cell <- ggplot(df_nCount_m, aes(x = value, fill = variable)) + 
  geom_density(alpha = 0.3) + scale_x_log10()+  xlab("UMI/cell")
plt.genes_per_cell <- ggplot(df_nFeat_m, aes(x = value, fill = variable)) + 
  geom_density(alpha = 0.3) + scale_x_log10()+   xlab("genes/cell")

plt.pre_post_1 <- FeatureScatter(so, feature1 = "nCount_RNA", feature2 = "nCount_SCT") + 
  xlab("UMI/cell (raw)") + ylab("UMI/cell (normalized)")
plt.pre_post_2 <- FeatureScatter(so, feature1 = "nFeature_RNA", feature2 = "nFeature_SCT") + 
  xlab("Genes/cell (raw)") + ylab("Genes/cell (normalized)")

plt.pre_post_norm <- CombinePlots(plots = list(plt.pre_post_1, plt.pre_post_2), ncol = 2, legend = 'none')

if (parameter.list$print.inline){
  print(plt.UMI_per_cell)
  print(plt.genes_per_cell)
  print(plt.pre_post_norm)
}

```

```{r deviance feature selection, include = FALSE}


# feature selection using deviance #############################################

# get deviant features using subset of data ####################################

if (parameter.list$feature.select.method == "deviance"){
  
  miko_message("Performing feature selection using deviance scores...")
  
  DefaultAssay(so) <- "SCT"
  so.sub <- downsampleSeurat(object = so, subsample.n = 20000)
  
  m <- GetAssayData(so.sub, slot = "data", assay = "SCT")
  m <- m[rownames(m) %in% rownames(so.sub), ]
  m <- as.matrix(m)
  batch.factor <- as.factor(so.sub@meta.data$Barcode)
  
  # system.time({
    devs <- scry::devianceFeatureSelection(object = m, batch = batch.factor, fam = "binomial")
  # })
  
  df.dev <- data.frame(
    gene = rownames(m),
    d = devs,
    is.var = rownames(m) %in% var.gene.orig
  ) %>% dplyr::arrange(-d)
  
  df.dev$d.norm <- df.dev$d/sum(df.dev$d)
  df.dev$d.cumsum <- cumsum(df.dev$d.norm )
  
  dev.gene <- df.dev$gene[df.dev$d.cumsum < 0.8]
  if (length(dev.gene) > 2000){
    dev.gene <- df.dev$gene[1:2000]
  }
  
  dev.gene <- dev.gene[!grepl("MT", toupper(gNames.list[dev.gene]))]
   so@assays[["SCT"]]@var.features <- dev.gene
  
} else if (parameter.list$feature.select.method == "hvg"){
  
  dev.gene <- var.gene.orig[!grepl("MT", toupper(gNames.list[var.gene.orig]))]
  so@assays[["SCT"]]@var.features <- dev.gene
  
}


```

```{r null model residuals, include = FALSE}

if (parameter.list$scale.method == "null_residuals"){
  
  
  m <-  GetAssayData(so, slot = "data", assay = "SCT")
  m <- m[rownames(m) %in% dev.gene, ]
  so[["DEV"]] <- CreateAssayObject(counts = m)
  m <- as.matrix(m)
  batch.factor <- as.factor(so@meta.data$Barcode)
  
  # system.time({
    m.deviance <- scry::nullResiduals(
      object = m,
      fam = c("binomial"),
      type = c("deviance"),
      batch = batch.factor
    )
  # })
  
  m.deviance[is.na(m.deviance)] <- 0
  rownames(m.deviance) <- rownames(m);
  colnames(m.deviance) <- colnames(m)
  invisible({gc()})
  
  
  m.deviance[m.deviance < parameter.list$clip.range[1]] <- parameter.list$clip.range[1]
  m.deviance[m.deviance > parameter.list$clip.range[2]] <- parameter.list$clip.range[2]
  
  # parameter.list$clip.range
  so@assays[["DEV"]]@scale.data <- m.deviance
  so@assays[["DEV"]]@var.features <- dev.gene
  
  DefaultAssay(so) <- "DEV" 
  
  rm(so.sub)
  rm(m);
  rm(m.deviance);
  invisible({gc() })
  
}

```

```{r identify artefact genes, include = FALSE}

u.bc <- unique(so@meta.data$Barcode)

if (length(u.bc) > 2){
  
  # var.gene <- var.gene.orig
  var.gene <- so@assays[[DefaultAssay(so)]]@var.features
  
  miko_message("Identifying artifact genes...")
  ag.res <-  findArtifactGenes(object = so, assay = "RNA", features = var.gene, 
                               meta.feature = "Barcode", umi.count.threshold = 5, difference.threshold = 30, verbose = T)
  plt.ag <- ag.res[["plot"]]
  
  try({
    gene.rep <- checkGeneRep(reference.genes = gNames.list, query.genes = plt.ag[["data"]][["gene"]])
    if (gene.rep == "ensembl"){
      plt.ag[["data"]][["gene"]] <- as.character(gNames.list[plt.ag[["data"]][["gene"]]])
      plt.ag[["layers"]][[2]][["data"]][["gene"]] <- as.character(gNames.list[plt.ag[["layers"]][[2]][["data"]][["gene"]]])
    }      
  }, silent = T)
  
  if (parameter.list$correct.artifact){
    miko_message(paste0(length(ag.res$artifact.gene), " artifact genes omitted from PCA..."))
    var.gene.update <- var.gene[!c(var.gene %in% ag.res$artifact.gene)]
    artifact.gene <- ag.res$artifact.gene
    
    # var.gene.update <- var.gene[!c(var.gene %in% artifact.gene)]
    # expr.cell <- getExpressedGenes(object = so, min.pct = 0.1, group = NA, group.boolean = "OR")
    # var.gene.update <- intersect(var.gene.update, expr.cell)
    
    
    # var.gene.update <- var.gene.update[!(var.gene.update %in% artifact.gene)]
    var.gene.update <- var.gene.update[!grepl("MT", toupper(gNames.list[var.gene.update]))]
    so@assays[[DefaultAssay(so)]]@var.features <- var.gene.update
    
    
  }
  
  if (parameter.list$print.inline) print(plt.ag)
  
}  else {
  var.gene <- so@assays[[DefaultAssay(so)]]@var.features
  var.gene.update <- var.gene[!grepl("MT", toupper(gNames.list[var.gene]))]
}


if (parameter.list$scale.method == "sct"){
  so <- GetResidual(so, features = so@assays[[DefaultAssay(so)]]@var.features, verbose = F)
}


# dev.gene <- dev.gene[!(dev.gene %in% artifact.gene)]

# DefaultAssay(so) <- "SCT"
# plt.ag
```



```{r c18 - preliminary umap}

# scale.byscale.by.mad
# if (parameter.list$scale.method == "sct"){
  n.dim.all <- 50
# } else if (parameter.list$scale.method == "null_residuals"){
#   n.dim.all <- 50
# }

# Run pca
if (parameter.list$pca.method == "pca"){
  miko_message("Running PCA...")
  so <- RunPCA(so, verbose = FALSE, features = so@assays[[DefaultAssay(so)]]@var.features, 
               npcs = n.dim.all)
} else if (parameter.list$pca.method == "glmpca"){
  miko_message("Running GLM-PCA...")
  so <- SeuratWrappers::RunGLMPCA(
    object = so,
    L = n.dim.all,
    assay = NULL,
    features = so@assays[[DefaultAssay(so)]]@var.features,
    reduction.name = "pca",
    reduction.key = "PC_",
    verbose = F,
    minibatch = "stochastic")
}

# specify number of PCA components to use for downstream analysis
pca.var.threshold <- parameter.list$pca.var.threshold
pca.prop <- propVarPCA(so)


# mat <- Seurat::GetAssayData(so, assay = DefaultAssay(so), slot = "scale.data")
# mat <- mat[rownames(mat) %in% so@assays[[DefaultAssay(so)]]@var.features, ]
# pca <- so[["pca"]]

# # Get the total variance:
# total_variance <- sum(matrixStats::rowVars(mat))
# eigValues = (pca@stdev)^2  ## EigenValues
# varExplained = eigValues / total_variance
# pca.prop$ve <- varExplained
# pca.prop$cve <- cumsum(pca.prop$ve)
# rm(mat); rm(pca); 
# invisible({gc()})

# pca.threshold.method <- "elbow"

pca.elbow.low <- 0.015
if (parameter.list$pca.component.select.method == "elbow"){
  nDim <- pcaElbow(pca.prop$pc.prop_var, low = pca.elbow.low, max.pc = 0.9)
} else if (parameter.list$pca.component.select.method == "cum_var"){
  nDim <- max(pca.prop$pc.id[pca.prop$pc.cum_sum% dplyr::filter(auc > 0.95)
# projectReduction(so, n.components = 10, show.n.features = 30)
```

```{r c19 - PCA}

pc.elbow_min_difference_ps <- 0.001
pc.cum_var_explained_threshold_ps <- pca.var.threshold
PC_number_method_ps <- 2

# calculate proportion of variance explained by each PC ---------------------
pc.std <- so@reductions[["pca"]]@stdev
pc.var <- pc.std^2
pc.prop_var <- pc.var/sum(pc.var)
pc.cum_sum <- cumsum(pc.prop_var)
pc.id <- c(1:ncol(so@reductions[["pca"]]))
scree.var <- data.frame(pc.id, pc.prop_var, pc.cum_sum)

# determine number of PCs included ------------------------------------------
pc.nPC_method_1 <- sum(abs(diff(pc.prop_var))>pc.elbow_min_difference_ps) +1 
pc.nPC_method_2 <- sum(pc.cum_sum% 
  dplyr::group_by(cluster_membership, barcode_labels) %>%
  tally() %>% 
  dplyr::mutate(freq = n / sum(n)) 

u_barcodes <- unique(df.all_barcodes$barcode_labels)

if (length(u_barcodes) > 1) {
  
  # convert long to wide (cell type table)
  df_for_wide_barcodes <- df.all_barcodes
  colnames(df_for_wide_barcodes)[colnames(df_for_wide_barcodes) == "barcode_labels"] <- "predicted_labels"
  df.all_barcodes_wide <- long2wide(df_for_wide_barcodes)
  
  # plot cluster compositio by barcode
  df.cluster_annotations_barcodes <- df.all_barcodes
  colnames(df.cluster_annotations_barcodes)[colnames(df.cluster_annotations_barcodes) == "barcode_labels"] <- "predicted_labels"
  
  plt.cluster_composition_barcodes <- plot.cluster_composition_alt(df.cluster_annotations_barcodes, 
                                                                   other_threshold = 0)
  
  plt.cluster_composition_barcodes <- plt.cluster_composition_barcodes + 
    theme_miko(legend = T) + 
    ggthemes::scale_fill_ptol("Barcode") + 
    labs(title = "Cluster Composition", subtitle = "Stratified by Barcodes")
} else {
  df_for_wide_barcodes <- data.frame()
  df.all_barcodes_wide <- data.frame()
  df.cluster_annotations_barcodes <- data.frame()
  plt.cluster_composition_barcodes <- c()
  
}

if (parameter.list$print.inline) {
  print(plt.cluster_composition_barcodes)
}

```



```{r c20 - cell cycle scoring, fig.width=14, fig.height = 4}

# get cell cycle genes
s.genes <- cc.genes$s.genes
g2m.genes <- cc.genes$g2m.genes

if (length(unique(parameter.list$organism.include)) > 1){
  s.genes <-  toupper(s.genes)
  g2m.genes <-  toupper(g2m.genes)
} else if (unique(parameter.list$organism.include) == "Mm"){
  s.genes <-  firstup(s.genes)
  g2m.genes <-  firstup(g2m.genes)
} else if (unique(parameter.list$organism.include) == "Hs"){
  s.genes <-  toupper(s.genes)
  g2m.genes <-  toupper(g2m.genes)
}

so <- ens2sym.so(so, gNames.list)

# cell cycle scoring and umap
plt.cc.umap <- tryCatch({
  
  so <- CellCycleScoring(so, s.features = s.genes, g2m.features = g2m.genes, set.ident = F)
  
  plt.cc.umap <- cluster.UMAP(
    so,
    pt.size = scMiko::autoPointSize(nrow(so), scale.factor = 1583),
    group.by = "Phase",
    x.label = "UMAP 1",
    y.label = "UMAP 2",
    plot.name = "Cell Cycle Classifications",
    include.labels = F) 
  
}, error = function(e){
  plt.cc.umap <- NULL
  return(plt.cc.umap)
})

# cluster-level composition
plt.cc <- NULL
if ("Phase" %in% names(so@meta.data)){
  df.cc <- data.frame(clusters =  so@meta.data[["seurat_clusters"]], phase =  so@meta.data[["Phase"]])
  
  df.cc.tally <- df.cc %>%
    group_by(clusters, phase) %>%
    tally()
  
  plt.cc.compo <- df.cc.tally %>%
    ggplot(aes(x = clusters, y = n, fill = phase)) + 
    geom_bar(position="fill", stat="identity") + 
    theme_miko(legend = T) + 
    labs(title = "Cluster Composition by Cell Cycle") + 
    xlab("Cluster ID") + ylab("Relative Frequency")
  
  # plt.umap_by_cluster + theme(legend.position = "none"), 
  
  plt.cc <- cowplot::plot_grid(plt.cc.umap + ggthemes::scale_colour_tableau(), 
                               plt.cc.compo + ggthemes::scale_fill_tableau(), nrow = 1, labels = c("F", "G"))
} 


if (parameter.list$print.inline) {
  print(plt.cc)
}

```
```{r c21 - variable gene list}

if (normalization_method_ps == "SCT"){
  df.var.meta <- so@assays[["SCT"]]@meta.features
  df.var.model <- so@assays[["SCT"]]@SCTModel.list[["model1"]]@feature.attributes
  var.gene <- so@assays[["SCT"]]@var.features
  df.var.model$ENSEMBLE <- df.var.meta$ENSEMBLE
  df.var.model$SYMBOL <- df.var.meta$SYMBOL
  so@assays[["SCT"]]@SCTModel.list[["model1"]]@feature.attributes <- df.var.model
  df.var.meta <- df.var.model %>% dplyr::filter(SYMBOL %in% var.gene)
  # df.var.meta <- df.var.meta[df.var.meta$sct.variable == T, ]
  # df.var.meta <- bind_cols(data.frame(genes = rownames(df.var.meta)), df.var.meta)
  df.var.meta <- df.var.meta %>% dplyr::arrange(-residual_variance)
  
  which.round <- colnames(df.var.meta)[ !(colnames(df.var.meta) %in% c("ENSEMBLE", "SYMBOL", "genes_log_gmean_step1", "step1_theta", "step1_(intercept)", "step1_log_umi"))]
  try({df.var.meta[which.round] <- signif(df.var.meta[which.round], 3)}, silent = T)
} else {
  df.var.meta <- NULL
}




```

```{r c22 - run additional dimensional reductions}

# 
# 
# if (parameter.list$do.additional.reduction){
#   
#   # tsne #####################
#   
#   plt.tsne.combo <- tryCatch({
#     so <-RunTSNE(so, dims = 1:nDim)
#     
#     # generate plots
#     plt.tsne_by_cluster <- DimPlot(so, reduction = "tsne", label = TRUE)  + ggtitle(label = "TSNE") +
#       xlab("TSNE 1") + ylab("TSNE 2")
#     
#     plt.tsne <- DimPlot(so, reduction = "tsne",group.by = "Barcode")  + ggtitle(label = "TSNE") +
#       xlab("TSNE 1") + ylab("TSNE 2")
#     
#     # combine plots
#     plt.tsne.combo <- cowplot::plot_grid(plt.tsne_by_cluster, plt.tsne, ncol = 2)
#   }, error = function(e){
#     plt.tsne.combo <- NULL
#     return(plt.tsne.combo)
#   })
#   
#   
#   # PCA ########################
#   
#   # generate plots
#   plt.pca_by_cluster <- DimPlot(so, reduction = "pca", label = TRUE)  + ggtitle(label = "PCA") +
#     xlab("PCA 1") + ylab("PCA 2")
#   
#   plt.pca <- DimPlot(so, reduction = "pca",group.by = "Barcode")  + ggtitle(label = "PCA") +
#     xlab("PCA 1") + ylab("PCA 2")
#   
#   # combine plots
#   plt.pca.combo <- cowplot::plot_grid(plt.pca_by_cluster, plt.pca, ncol = 2)
#   
#   # ICA ########################
#   so <- RunICA(so, verbose = F)
#   
#   # generate plots
#   plt.ica_by_cluster <- DimPlot(so, reduction = "ica", label = TRUE)  + ggtitle(label = "ICA") +
#     xlab("ICA 1") + ylab("ICA 2")
#   
#   plt.ica <- DimPlot(so, reduction = "ica",group.by = "Barcode")  + ggtitle(label = "ICA") +
#     xlab("ICA 1") + ylab("ICA 2")
#   
#   # combine plots
#   plt.ica.combo <- cowplot::plot_grid(plt.ica_by_cluster, plt.ica, ncol = 2)
#   
# } else {
#   plt.tsne.combo <- NULL
#   plt.ica.combo <- NULL
#   plt.pca.combo <- NULL
# }
# 
# 
# if (parameter.list$print.inline){
#   print(plt.tsne.combo)
#   print(plt.pca.combo)
#   print(plt.ica.combo)
# }
```




```{r c23 - summary statistics, warning = FALSE}

meta.data <- names(so@meta.data)

# get subgroup names
quantifier.stem <- c("Count", "Feature", "percent.mt")
which.quant <- grepl(paste(quantifier.stem, collapse = "|"), meta.data)
quant.names <- meta.data[which.quant]
meta.names <- meta.data[!which.quant]

df.meta <- so@meta.data

summary.list <- list()

for (i in 1:length(meta.names)){
  
  current.meta.name <- meta.names[i]
  if (current.meta.name %in% c("unmatched.rate", "S.Score", "G2M.Score")) next
  
  df.meta.summary <- NULL
  
  df.meta.summary <- df.meta %>%
    group_by(get(meta.names[i])) %>%
    dplyr::summarise(n.nuclei =n(),
              UMI.mean = round(mean(nCount_RNA)),
              UMI.median =  round(median(nCount_RNA)),
              UMI.min = min(nCount_RNA),
              UMI.max = max(nCount_RNA),
              genes.mean =  round(mean(nFeature_RNA)),
              genes.median =  round(median(nFeature_RNA)),
              genes.min = min(nFeature_RNA),
              genes.max = max(nFeature_RNA),
              mito.mean =  signif(mean(percent.mt),3) ,
              mito.median =  signif(median(percent.mt),3) ,
              mito.min =  signif(min(percent.mt),3) ,
              mito.max =  signif(max(percent.mt),3))
  
  colnames(df.meta.summary)[1] <- meta.names[i]
  
  if (dim(df.meta.summary)[1] == 1){
    df.meta.summary <- data.frame(t(df.meta.summary)[2:dim(df.meta.summary)[2], ])
    colnames(df.meta.summary)[1] <- meta.names[i]
  } else {
    df.meta.summary <- as.data.frame(t(df.meta.summary))
    new.col.name <- as.matrix(df.meta.summary[1,])
    df.meta.summary <- as.data.frame(df.meta.summary[2:nrow(df.meta.summary),])
    colnames(df.meta.summary) <- new.col.name
  }
  
  try({
    summary.list[[meta.names[i]]] <- df.meta.summary
  }, silent = T)
  
}

```
```{r central log}

# update central log
run.id <- NULL
if (!exists("user")) user <- "guest"

clog.update.success <- F
if (parameter.list$update.log){
  try({
  run.id <-  updateCentralLog(Module = "M01", input.data = which.data, input.subset = which.strata, pdf.flag = parameter.list$save.pdf)
  clog.update.success <-  T
}, silent = F)
}

if (is.null(run.id))  {
  warning("Central log update was unsuccessful :(\n")
  run.id <- paste("M01_", user, "_r", paste0(format(Sys.time(), '%s')), sep = "", collapse = "")
}

```



```{r analysis log}

# Normalization Method #
if (normalization_method_ps == "NFS"){
  norm_method <- "normalize, find features, scale"
} else if (normalization_method_ps == "SCT"){
  norm_method <- "SCT"
}

df.log <- addLogEntry("Normalization Method", norm_method, df.log, "norm_method")
df.log <- addLogEntry("PCA Method", parameter.list$pca.method, df.log, "pca.method")
df.log <- addLogEntry("Principal Components Included", nDim, df.log, "nDim")
df.log <- addLogEntry("Unmatched rate upper limit", unmatch.high, df.log, "unmatch.high")
df.log <- addLogEntry("Unmatched rate lower limit", unmatch.low, df.log, "unmatch.low")
df.log <- addLogEntry("nPCA components used", nDim, df.log, "nDim")

end.time <- proc.time()
elapsed.time <- round((end.time - start.time)[[3]], 2)
df.log <- addLogEntry("Run Time (s)", elapsed.time, df.log, "elapsed.time")
df.log <- addLogEntry("Run Identifier", run.id, df.log, "run.id")

df.log <- addLogEntry("User", user, df.log, "user")
df.log <- addLogEntry("Central Log Updated", clog.update.success, df.log, "clog.update.success")



```

```{r output path and save results}

# output path
if (!exists("data.path")) data.path = ""
output.path <- paste0(data.path, "Module_Outputs/", paste0(run.id,"_",format(Sys.time(), '%d%m%y'), "/"))

# create output directories
dir.create(output.path)
dir.create(paste0(output.path, "Tables/"))
if (parameter.list$save.pdf) dir.create(paste0(output.path, "PDF/"))

parameter.list$save.filename <- paste0(run.id, "_", parameter.list$save.filename)

df.log <- addLogEntry("Output Results (.Rdata)", as.character(parameter.list$save.filename), df.log, "save.filename")
df.log <- addLogEntry("Output Directory", as.character(parameter.list$save.directory), df.log, "save.directory")
df.log <- addLogEntry("Output Saved", as.character(parameter.list$save.flag), df.log, "save.flag")

# save results to Preprocessed_Datasets/
save.filename <- paste0(data.path, parameter.list$save.directory , parameter.list$save.filename)

df.log_Module_1 <- df.log
if ((parameter.list$save.flag == TRUE) ){
  save(so, gNames.list, df.log_Module_1, file = save.filename)
}


```



QC Overview
===================================== 

Sidebar {.sidebar }
-------------------------------------

**Description**: scRNAseq preprocessing and quality check\

**Figure Legends**:\
**A|** Distribution of gene counts, transcript counts, and mitochondrial percentage per cell.\
**B|** Relationship between QC measures. *Color* scale represents cell density.\
**C|** Distribution of mitochondial percentage per cell, stratified by samples (Barcodes).\
**D|** Number of cells pre- and post- filtering step. Percentage of omitted cells are indicated.\
**E|** Relationship between pre and post-normalized transcript and gene counts.\
**F|** Distribution of pre and post-normalized transcript counts.\
**G|** Distribution of pre and post-normalized gene counts.\
**H|** Variable gene plot.\

**Notes**:\
*For A-C*, all cells are shown (i.e., pre-filtering step)\
*For E-H*, subset of filtered cells are shown (i.e., post-filtering step)\

**Definitions**:\
**nFeature**: Number of genes.\
**nCount**: Number of transcripts.\
**percent.mt**: Percentage of mitochondrial transcripts.\
**Barcode**: Sample identifier.\

Row 
-----------------------------------------------------------------------

### A| QC distributions

```{r plt.QC_violin, fig.height= 5, fig.width = 10}

print(plt.QC_violin)

savePDF(file.name = paste0(output.path, "PDF/", "M01_QC_violin.pdf"), plot.handle = plt.QC_violin, fig.width = 10, fig.height = 5, save.flag = parameter.list$save.pdf)

```

### B| Pairwise QC relationships

```{r plt.QC_scatter, fig.height= 5, fig.width = 12}
print(plt.QC_scatter)
savePDF(file.name = paste0(output.path, "PDF/", "M01_QC_scatter.pdf"), plot.handle = plt.QC_scatter, fig.width = 12, fig.height = 5, save.flag = parameter.list$save.pdf)

```

### C| Mitochondrial content distribution 

```{r mt.mito_by_bc}
print(plt.mito_by_bc + theme_miko(legend = T))
savePDF(file.name = paste0(output.path, "PDF/", "M01_mito_by_bc.pdf"), 
        plot.handle = plt.mito_by_bc + theme_miko(legend = T), fig.width = 8, fig.height = 5, save.flag = parameter.list$save.pdf)

```

### D| Cells Recovered: Pre- and Post-Filtering

```{r plt.filter_pre_post}
print(plt.filter_pre_post + theme_miko(legend = T))
savePDF(file.name = paste0(output.path, "PDF/", "M01_filter_pre_post.pdf"), plot.handle = plt.filter_pre_post + theme_miko(legend = T), 
        fig.width = 5, fig.height = 5, save.flag = parameter.list$save.pdf)
```

Row 
-----------------------------------------------------------------------

### E| Pre- vs Post-Normalization Statistics

```{r plt.pre_post_norm}
print(plt.pre_post_norm)
savePDF(file.name = paste0(output.path, "PDF/", "M01_pre_post_norm.pdf"), plot.handle = plt.pre_post_norm, 
        fig.width = 5, fig.height = 5, save.flag = parameter.list$save.pdf)
```

### F| Transcripts per cell

```{r plt.UMI_per_cell}
print(plt.UMI_per_cell + theme_miko(legend = T))
savePDF(file.name = paste0(output.path, "PDF/", "M01_transcripts_per_cell.pdf"), plot.handle = plt.UMI_per_cell + theme_miko(legend = T), 
        fig.width = 5, fig.height = 5, save.flag = parameter.list$save.pdf)
```

### G| Genes per cell

```{r plt.genes_per_cell}
print(plt.genes_per_cell + theme_miko(legend = T))
savePDF(file.name = paste0(output.path, "PDF/", "M01_genes_per_cell.pdf"), plot.handle = plt.genes_per_cell + theme_miko(legend = T), 
        fig.width = 5, fig.height = 5, save.flag = parameter.list$save.pdf)
```


### H| Variable gene plot

```{r plt.variable_genes}
print(plt.variable_genes)
savePDF(file.name = paste0(output.path, "PDF/", "M01_variable_genes.pdf"), plot.handle = plt.variable_genes, 
        fig.width = 5, fig.height = 5, save.flag = parameter.list$save.pdf)
```

Row 
-----------------------------------------------------------------------

### Cells Recovered
```{r valuebox1}
valueBox(ncol(so))
```

### Genes per Cell (Median)
```{r valuebox2}
valueBox(median(so$nFeature_RNA))
```

### Transcripts per Cell (Median)
```{r valuebox3}
valueBox(median(so$nCount_RNA))
```

Filter and PCA Summary
===================================== 

Sidebar {.sidebar }
-------------------------------------

**Description**: QC rank plots and filtering cutoffs, along with PCA summary. 

**Figure Legends**:\
**A|** Transcript vs. gene count relationship.\
**B|** Transcript count per cell ranking.\
**C|** Gene count per cell ranking.\
**D|** Mitochondrial content per cell ranking.\
**E|** Venn diagram of filtered cells, grouped by filtering criteria.\
**F|** PCA scree plot showing proportion of variance explained by each principal component.\
**G|** Cumulative variance explained by principal components.\
**H|** Artifact gene detection (Lause 2021)

Row {.tabset}
-------------------------------------

### Filtering Statistics

```{r plt QC ranks mT, fig.width = 15, fig.height = 8}
plt.QC.stat <-  cowplot::plot_grid(cowplot::plot_grid(plt.gene.umi.mito, plt.umi.rank.mito, plt.gene.rank.mito, plt.mt.rank.mito, ncol = 2, labels = c("A", "B", "C", "D"), align = "h"), plt.filter.venn, labels = c("", "E")) 

print(plt.QC.stat)

savePDF(file.name = paste0(output.path, "PDF/", "M01_filtering_statistics.pdf"), 
        plot.handle =  plt.QC.stat, 
        fig.width = 15, fig.height = 8, save.flag = parameter.list$save.pdf)

```



### PCA Scree Plots

```{r plt.PCA, fig.width=10, fig.height = 4}

cowplot::plot_grid(plt.scree1, plt.scree2, ncol=2, labels=  c("F", "G"))

savePDF(file.name = paste0(output.path, "PDF/", "M01_scree.pdf"), plot.handle =  cowplot::plot_grid(plt.scree1, plt.scree2, ncol=2), 
        fig.width = 8, fig.height = 4, save.flag = parameter.list$save.pdf)

```


### Artifact Genes

```{r artifact gene plot, fig.height=7, fig.width=8}

try({
  plt.ag <- cowplot::plot_grid(plt.ag, labels = "H")
  print(plt.ag)
  
  savePDF(file.name = paste0(output.path, "PDF/", "M01_artifact_genes.pdf"), plot.handle =  plt.ag, 
          fig.height=7, fig.width=8, save.flag = parameter.list$save.pdf)
}, silent =T)


```

```{r plt QC ranks uR, fig.width=12, fig.height = 7}


### QC (unmatched rate)
# try({
#   cowplot::plot_grid(plt.umi.rank.match, plt.gene.rank.match, plt.mt.rank.match,
#                     plt.unmatch.umi, plt.gene.umi.match, plt.umatch.rank,ncol = 3, nrow = 2) 
# 
# }, silent = T)

```


```{r}
# try({
#   savePDF(file.name = paste0(output.path, "PDF/", "M01_QC_unmatchedRate_ranks.pdf"), 
#         plot.handle =   cowplot::plot_grid(plt.umi.rank.match, plt.gene.rank.match, plt.mt.rank.match,
#                     plt.unmatch.umi, plt.gene.umi.match, plt.umatch.rank,ncol = 3, nrow = 2) , 
#         fig.width = 17, fig.height = 10, save.flag = save.pdf)
#   
# }, silent = T)

```


UMAP
===================================== 

Sidebar {.sidebar }
-------------------------------------

**Description**: Uniform Manifold Approximation and Projection (UMAP) plots with overlaid cluster, sample, QC and cell cycle information. 

**Figure Legends**:\
**A|** Cell cluster UMAP.\
**B|** Sample UMAP.\
**C|** Transcript count (n/cell) UMAP.\
**D|** Gene count (n/cell) UMAP.\
**E|** Mitochondrial content (%/cell) UMAP.\
**F|** Cell cycle state UMAP.\
**G|** Barplot of Relative abundances of cells in each cell cycle state, stratified by cell cluster.\

Row {.tabset}
-----------------------------------------------------------------------

### Cell Clusters

```{r, fig.width=12, fig.height=5}
print(plt.umap.combo)
  savePDF(file.name =paste0(output.path, "PDF/", "M01_umap.pdf"), plot.handle =  plt.umap.combo, 
          fig.width = 12, fig.height = 5, save.flag = parameter.list$save.pdf)
  
```



### QC Statistics


```{r UMAP QC plots, fig.width=15, fig.height=6}

plt.umap.qc

savePDF(file.name = paste0(output.path, "PDF/", "M01_QC_umap.pdf"), plot.handle =  plt.umap.qc, 
        fig.width = 12, fig.height = 6, save.flag = parameter.list$save.pdf)


```

### Cell Cycle


```{r plot cc plot, fig.width=12, fig.height=5}

try({
  print(plt.cc)
  savePDF(file.name = paste0(output.path, "PDF/", "M01_cell_cycle.pdf"), plot.handle =  plt.cc, 
        fig.width = 14, fig.height = 4, save.flag = parameter.list$save.pdf)
  
}, silent = T)

```



```{r, fig.width=12, fig.height=5}

### TSNE
# print(plt.tsne.combo)
# if (!is.null(plt.tsne.combo)){
#     savePDF(file.name = paste0(output.path, "PDF/", "M01_tsne.pdf"), plot.handle =  plt.tsne.combo, 
#           fig.width = 12, fig.height = 5, save.flag = save.pdf)
# }

```



```{r, fig.width=12, fig.height=5}

### PCA
# print(plt.pca.combo)
# 
# if (!is.null(plt.pca.combo)){
#     savePDF(file.name = paste0(output.path, "PDF/", "M01_pca.pdf"), plot.handle =  plt.pca.combo, 
#           fig.width = 12, fig.height = 5, save.flag = save.pdf)
# }
```



```{r, fig.width=12, fig.height=5}

### ICA
# print(plt.ica.combo)
# 
# if (!is.null(plt.ica.combo)){
#     savePDF(file.name = paste0(output.path, "PDF/", "M01_ica.pdf"), plot.handle =  plt.ica.combo, 
#           fig.width = 12, fig.height = 5, save.flag = save.pdf)
# }

```



```{r}


# report.subgroup <- length(unique(so@meta.data[["subset_group"]]))> 1
#   a1 <- knitr::knit_expand(text = sprintf("\n%s", "QC by subgroup"))
#   a2 <- knitr::knit_expand(text = "\n=====================================")
#   out1 <- paste(a1, a2, collapse = '\n')

# `r if (report.subgroup){ paste(knitr::knit(text = paste(out1, collapse = '\n'))) }`

```




```{r}

#   a1 <- knitr::knit_expand(text = sprintf("\n%s", "TF Network"))
# a2 <- knitr::knit_expand(text = sprintf("\n%s\n", "====================================="))
# a3 <- knitr::knit_expand(text = sprintf("\n%s", "Sidebar {.sidebar}"))
# a4 <- knitr::knit_expand(text = sprintf("\n%s\n", "-------------------------------------"))
# a5 <- knitr::knit_expand(text = "\n**QC statistics by subgroup**\n\n**Description**: Data are stratified by subgroup, and QC statistics are shown.")
# 
# out2 <- paste(a3, a4,a5, collapse = '\n')

# `r if (report.subgroup){ paste(knitr::knit(text = paste(out2, collapse = '\n'))) }`

```



```{r scale free}

# r1 <- knitr::knit_expand(text = sprintf("\n%s", "Row {.tabset}"))
# r2 <- knitr::knit_expand(text = sprintf("\n%s\n", "-------------------------------------"))
# 
# row.chunk <- paste(r1, r2,collapse = '\n')
# 
# a1 <- knitr::knit_expand(text = sprintf("### %s\n", "QC distributions"))  # tab header
# a2 <- knitr::knit_expand(text = sprintf("\n```{r %s, message=FALSE, warning=FALSE, fig.width = 12, fig.height=5}",  #fig.width = 8, fig.height=8,
#                                         paste("plt.violin_by_subgroup", 1, sep = "")))
# a3 <- knitr::knit_expand(text = sprintf("\n %s", "print(plt.violin_by_subgroup)"))
# a4 <- knitr::knit_expand(text = "\n```\n") # end r chunk
# 
# a1.chunk <- paste(a1, a2, a3, a4, collapse = '\n') 
# 
# a1 <- knitr::knit_expand(text = sprintf("### %s\n", "QC table"))  # tab header
# a2 <- knitr::knit_expand(text = sprintf("\n```{r %s, message=FALSE, warning=FALSE, fig.width = 10, fig.height=5}",  #fig.width = 8, fig.height=8,
#                                         paste("df.qc_sum table", 1, sep = "")))
# a3 <- knitr::knit_expand(text = sprintf("\n %s", "knitr::kable(df.qc_sum)"))
# a4 <- knitr::knit_expand(text = "\n```\n") # end r chunk
# 
# a2.chunk <- paste(a1, a2, a3, a4, collapse = '\n') 
# 
# try({
#       savePDF(file.name = paste0(output.path, "PDF/", "general_scaleFree_parameterOptimization.pdf"), plot.handle =  plt.sft, 
#             fig.width = 10, fig.height = 5, save.flag = parameter.list$save.pdf)
# }, silent = T)
# try({
#       savePDF(file.name = paste0(output.path, "PDF/", "tf_scaleFree_parameterOptimization.pdf"), plot.handle =  plt.sft.tf, 
#             fig.width = 10, fig.height = 5, save.flag = parameter.list$save.pdf)
# }, silent = T)
# 
# `r if (report.subgroup){ paste(knitr::knit(text = paste(row.chunk, collapse = '\n'))) }`
# 
# `r if (report.subgroup){paste(knitr::knit(text = paste(a1.chunk, collapse = '\n')))}`
# 
# `r if (report.subgroup){paste(knitr::knit(text = paste(a2.chunk, collapse = '\n')))}`
# 
# 
# 
# 
# QC by subgroups (pre-filtering)
# ===================================== 
# 
# Row {data-height=700}
# -----------------------------------------------------------------------
# 
# ```{r plt.violin_by_subgroup, fig.height= 5, fig.width = 12}
# 
# print(plt.violin_by_subgroup)
# 
# savePDF(file.name = paste0(output.path, "PDF/", "M01_violin_by_subgroup.pdf"), plot.handle =  plt.violin_by_subgroup, 
#         fig.width = 12, fig.height = 5, save.flag = parameter.list$save.pdf)
# 
# ```
# 
# Row {.tabset}
# -----------------------------------------------------------------------
# 
# ### QC summary
# 
# ```{r df.qc_sum}
# knitr::kable(df.qc_sum)
# ```
# 
# ### nFeature_RNA ANOVA
# 
# ```{r nFeat_aov}
# knitr::kable(nFeat_aov[[1]])
# ```
# 
# ### nCount_RNA ANOVA
# 
# ```{r nCount_aov}
# knitr::kable(nCount_aov[[1]])
# ```
# 
# ### percent.mt ANOVA
# 
# ```{r mT_aov}
# knitr::kable(mT_aov[[1]])
# ```
# 
# 



```








```{r plt.cluster_composition_barcodes, fig.width=13, fig.height=4}

# Cluster Composition
# ===================================== 

# try({
#   print(plt.cluster_composition_barcodes)
#   savePDF(file.name = paste0(output.path, "PDF/", "M01_cluster_composition_barplot.pdf"), plot.handle = plt.cluster_composition_barcodes, 
#           fig.width = 13, fig.height = 4, save.flag = parameter.list$save.pdf)  
# }, silent = T)

```

QC Tables
===================================== 

Sidebar {.sidebar }
-------------------------------------

**Description**:\
QC statistics stratified by factors of potential interest, including cell cluster and cell cycle. 

**Definitions**\
**n.nuclei**: number of nuclei/cells\
**UMI.mean**: mean transcript count (n/cell)\
**UMI.median**: median transcript count (n/cell)\
**UMI.min**: minimum transcript count (n/cell)\
**UMI.max**: maximum transcript count (n/cell)\
**gene.mean**: mean gene count (n/cell)\
**gene.median**: median gene count (n/cell)\
**gene.min**: minimum gene count (n/cell)\
**gene.max**: maximum gene count (n/cell)\
**mito.mean**: mean mitochondrial content (%/cell)\
**mito.median**: median mitochondrial content (%/cell)\
**mito.min**: minimum mitochondrial content (%/cell)\
**mito.max**: maximum mitochondrial content (%/cell)\



Row {.tabset}
-------------------------------------


```{r summary statistis,  echo = FALSE, eval = TRUE, message=FALSE, warning=FALSE}

out <- flex.multiTabTables(summary.list, "summary.list")

```

`r paste(knitr::knit(text = paste(out, collapse = '\n')))`


Variable Genes Table
===================================== 

Sidebar {.sidebar }
-------------------------------------

**Description**:\
Results from SCtransform workflow. In brief, variance stabilizing transformation to UMI count data was applied using regularized Negative Binomial regression model to remove unwanted effects from UMI data. See Seurat::sctransform() for details. \

**Definitions**\
gmean: geometric mean\
variance: expression variance\
genes_log_gmean_step1: log-geometric mean of genes used in initial step of parameter estimation\

**Citation**:\
Hafemeister, C. & Satija, R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol 20, 296 (December 23, 2019). https://doi.org/10.1186/s13059-019-1874-1

Row 
-------------------------------------

```{r var table list}

if (!is.null(df.var.meta)){
  flex.asDT(df.var.meta)
}

```


```{r save variable gene table}
if (!is.null(df.var.meta)){
  if (parameter.list$save.pdf){
  write.csv(df.var.meta, file = paste0(output.path, "Tables/", "variableGenes.csv"), 
    row.names = F)
  }
}
```

Log
===================================== 

```{r}
knitr::kable(df.log_Module_1)

```

```{r save analysis log as csv}

try({
  write.csv(df.log_Module_1, file = paste0(output.path, "Tables/", "analysisLog.csv"), 
    row.names = F)  
}, silent = T)

```

```{r merge pdfs, include = FALSE}

# combine pdfs into single binder
# if (parameter.list$save.pdf){
#   try({
#     pdf.list <- list.files (path = paste0(output.path, "PDF/") )
#     pdf.list <- paste0( paste0(output.path, "PDF/"), pdf.list[grepl(".pdf", pdf.list)])
#     
#     pdftools::pdf_combine(pdf.list, output =  paste0(output.path, "PDF/merged_binder.pdf"))
#   }, silent = T)
# }


```